Going to College is one of the biggest financial decisions a family or a young adult will have to make. Recently (as of late 2019) the US Department of Education made available the earnings by Field of Study for hundreds of institutions. The tool provided by the US Department of education:
https://collegescorecard.ed.gov/
It is really good, if the student has already decided the field of study or the institutions where he/she would like to apply. Having said this, the tool does not provide an easy way to compare the financial benefit of one degree versus another; or the financial benefit of choosing one institution versus all others that offer the same degree.
The US department of education , provides the option to download the data sets used in the tool from the following website:
https://collegescorecard.ed.gov/data/
The analsysis uses the data in the file FieldOfStudyData1516_1617_PP.csv; combining it with Institution information to display a financial comparison of programs. The Earning by Field of study data shows the Median Earnings (and Debt) of Students as of 1 year after graduation for the classes 2015_16 anf 2016_17.
The Data has been pre_processed already to show only the records for which Earnings and Tuition information is available. The data has 19,563 records with 14 columns.
FieldofStudy_Earnings_for_rep = read.csv("C:/Users/figue/Documents/Collegeboard_data/Field_of_Studies_Earn_rep.csv", stringsAsFactors = FALSE)
dim(FieldofStudy_Earnings_for_rep)
## [1] 19563 14
colnames(FieldofStudy_Earnings_for_rep)
## [1] "INSTNM" "MAIN" "CIPDESC"
## [4] "STABBR" "DEBTMEDIAN" "MD_EARN_WNE"
## [7] "ADM_RATE" "SAT_AVG" "Tuition_Average"
## [10] "Tuition_more_110K" "Debt_payback_YR" "Tuit_Avg_4Y_Pyback"
## [13] "Tuit_Inc_4Y_Pyback" "Salary_10Y_4Y_Tui"
The following is a brief description of the columns.
INSTNM: Institution Name.
MAIN: If reads 1 if it is the main campus for the institution otherwise reads 0.
CIPDESC: The description of the Field of Study.
STABBR: State where the Institution is located.
DEBTMEDIAN: The Median Amount of debt of studends graduating from that institution with that field of study.
MD_EARN_WNE: Median Earnings of the students of the field of study from the specific institution 1 year after graduation.
ADM_RATE: Overall Admission rate for the Institution.
SAT_AVG: SAT Average Score of all Students accepted by the Institution (independently of Field of Study).
Tuition_Average: Average Tuition paid by students at Institution (independently of Field of Study).
Tuition_more_110K: Average Tuition paid by students, belonging to families with incomes of more than 110K per year, at Institution (independently of Field of Study).
Debt_payback_YR: Number of years to pay median debt based on earnings 1 year after graduation (if all earnings were used to pay debt).
Tuit_Avg_4Y_Pyback: Number of years to payback 4 years of Average Tuition based on earnings 1 year after migration (if all earnings were used to pay debt).
Tuit_Inc_4Y_Pyback: Number of years to payback 4 years of Tuition (Family Income >110K) based on earnings 1 year after migration (if all earnings were used to pay debt).
Salary_10Y_4Y_Tui: Potential Total Earnings 10 Years after graduation after paying 4 Years of Average Tuition.
The last 4 columns are calculated columns, All others come directly from the college scorecard data files.
There are more than 300 Field of Studies for Bachelor Degrees however the most popular (more than 50 institutions offer these) are listed following.
FieldofStudy_Earnings_for_rep_count = aggregate(INSTNM ~ CIPDESC,data= FieldofStudy_Earnings_for_rep, length)
FieldofStudy_Earnings_for_rep_count=FieldofStudy_Earnings_for_rep_count[order(FieldofStudy_Earnings_for_rep_count$INSTNM,decreasing=TRUE),]
FieldofStudy_Earnings_for_rep_count[FieldofStudy_Earnings_for_rep_count$INSTNM>49,]
| CIPDESC | INSTNM | |
|---|---|---|
| 47 | Business Administration, Management and Operations. | 1117 |
| 234 | Psychology, General. | 871 |
| 244 | Registered Nursing, Nursing Administration, Nursing Research and Clinical Nursing. | 809 |
| 83 | Criminal Justice and Corrections. | 655 |
| 41 | Biology, General. | 648 |
| 1 | Accounting and Related Services. | 619 |
| 269 | Teacher Education and Professional Development, Specific Levels and Methods. | 571 |
| 62 | Communication and Media Studies. | 544 |
| 114 | English Language and Literature, General. | 456 |
| 188 | Marketing. | 414 |
| 145 | Health and Physical Education/Fitness. | 405 |
| 179 | Liberal Arts and Sciences, General Studies and Humanities. | 379 |
| 231 | Political Science and Government. | 374 |
| 260 | Sociology. | 366 |
| 126 | Fine and Studio Arts. | 356 |
| 125 | Finance and Financial Management Services. | 345 |
| 69 | Computer and Information Sciences, General. | 326 |
| 151 | History. | 319 |
| 258 | Social Work. | 290 |
| 270 | Teacher Education and Professional Development, Specific Subject Areas. | 287 |
| 195 | Mechanical Engineering. | 277 |
| 97 | Economics. | 272 |
| 91 | Design and Applied Arts. | 255 |
| 50 | Business/Commerce, General. | 240 |
| 144 | Health and Medical Administrative Services. | 219 |
| 73 | Computer Science. | 206 |
| 103 | Electrical, Electronics and Communications Engineering. | 205 |
| 58 | Civil Engineering. | 184 |
| 193 | Mathematics. | 171 |
| 94 | Drama/Theatre Arts and Stagecraft. | 167 |
| 206 | Natural Resources Conservation and Research. | 167 |
| 159 | Human Resources Management and Services. | 162 |
| 251 | Romance Languages, Literatures, and Linguistics. | 156 |
| 203 | Multi/Interdisciplinary Studies, Other. | 152 |
| 204 | Music. | 146 |
| 173 | Journalism. | 145 |
| 124 | Film/Video and Photographic Arts. | 139 |
| 240 | Public Relations, Advertising, and Applied Communication. | 135 |
| 54 | Chemical Engineering. | 129 |
| 238 | Public Health. | 126 |
| 262 | Special Education and Teaching. | 126 |
| 18 | Anthropology. | 125 |
| 158 | Human Development, Family Studies, and Related Services. | 125 |
| 154 | Hospitality Administration/Management. | 122 |
| 183 | Management Information Systems and Services. | 122 |
| 14 | Allied Health Diagnostic, Intervention, and Treatment Professions. | 119 |
| 55 | Chemistry. | 113 |
| 71 | Computer Engineering. | 111 |
| 163 | Information Science/Studies. | 105 |
| 147 | Health Services/Allied Health/Health Sciences, General. | 99 |
| 242 | Radio, Television, and Digital Communication. | 99 |
| 250 | Rhetoric and Composition/Writing Studies. | 97 |
| 84 | Criminology. | 87 |
| 171 | International Relations and National Security Studies. | 87 |
| 74 | Computer Software and Media Applications. | 83 |
| 27 | Architecture. | 73 |
| 184 | Management Sciences and Quantitative Methods. | 71 |
| 17 | Animal Sciences. | 69 |
| 77 | Computer/Information Technology Administration and Management. | 69 |
| 160 | Human Services, General. | 68 |
| 63 | Communication Disorders Sciences and Services. | 67 |
| 170 | International Business. | 66 |
| 37 | Biochemistry, Biophysics and Molecular Biology. | 65 |
| 172 | International/Global Studies. | 64 |
| 137 | Geography and Cartography. | 62 |
| 138 | Geological and Earth Sciences/Geosciences. | 62 |
| 43 | Biomedical/Medical Engineering. | 61 |
| 263 | Specialized Sales, Merchandising and Marketing Operations. | 61 |
| 161 | Industrial Engineering. | 57 |
| 256 | Social Sciences, General. | 57 |
| 52 | Business/Managerial Economics. | 56 |
| 5 | Agricultural Business and Management. | 55 |
| 60 | Clinical/Medical Laboratory Science/Research and Allied Professions. | 52 |
| 2 | Aerospace, Aeronautical and Astronautical Engineering. | 51 |
| 92 | Dietetics and Clinical Nutrition Services. | 51 |
| 181 | Linguistic, Comparative, and Related Language Studies and Services. | 51 |
The statistical analysis will be done for these Fields of Study. Histograms will be created based on Salary_10Y_4Y_Tui field. The histograms will be created for the Top 10 and Bottom 10 Fields of Study based on Financial Outcome using the this field.
FieldofStudy_Earnings_for_rep_avg = aggregate(Salary_10Y_4Y_Tui ~ CIPDESC,data= FieldofStudy_Earnings_for_rep, mean)
FieldofStudy_Earnings_for_rep_avg=FieldofStudy_Earnings_for_rep_avg[order(FieldofStudy_Earnings_for_rep_avg$Salary_10Y_4Y_Tui,decreasing=TRUE),]
#make sure they are popular
FieldofStudy_Earnings_for_rep_avg = FieldofStudy_Earnings_for_rep_avg[FieldofStudy_Earnings_for_rep_avg$CIPDESC %in% FieldofStudy_Earnings_for_rep_count[FieldofStudy_Earnings_for_rep_count$INSTNM>49,"CIPDESC"],]
#Mean Earnings all fields of study
mean(FieldofStudy_Earnings_for_rep_avg$Salary_10Y_4Y_Tui)
## [1] 312300.2
#Top 10
FieldofStudy_Earnings_for_rep_top10 = head(FieldofStudy_Earnings_for_rep_avg,10)
FieldofStudy_Earnings_for_rep_top10
| CIPDESC | Salary_10Y_4Y_Tui | |
|---|---|---|
| 71 | Computer Engineering. | 591431.1 |
| 103 | Electrical, Electronics and Communications Engineering. | 588569.1 |
| 73 | Computer Science. | 584529.4 |
| 161 | Industrial Engineering. | 562994.7 |
| 244 | Registered Nursing, Nursing Administration, Nursing Research and Clinical Nursing. | 549967.4 |
| 54 | Chemical Engineering. | 548370.8 |
| 195 | Mechanical Engineering. | 540788.6 |
| 2 | Aerospace, Aeronautical and Astronautical Engineering. | 537052.9 |
| 58 | Civil Engineering. | 503869.7 |
| 69 | Computer and Information Sciences, General. | 490814.4 |
#Bottom 10
FieldofStudy_Earnings_for_rep_bottom10 = tail(FieldofStudy_Earnings_for_rep_avg,10)
FieldofStudy_Earnings_for_rep_bottom10
| CIPDESC | Salary_10Y_4Y_Tui | |
|---|---|---|
| 242 | Radio, Television, and Digital Communication. | 198901.7 |
| 63 | Communication Disorders Sciences and Services. | 197985.9 |
| 250 | Rhetoric and Composition/Writing Studies. | 196935.4 |
| 114 | English Language and Literature, General. | 194206.0 |
| 181 | Linguistic, Comparative, and Related Language Studies and Services. | 188844.8 |
| 18 | Anthropology. | 175871.9 |
| 204 | Music. | 169351.9 |
| 126 | Fine and Studio Arts. | 159514.6 |
| 124 | Film/Video and Photographic Arts. | 132076.0 |
| 94 | Drama/Theatre Arts and Stagecraft. | 125056.0 |
The average Earnings over 10 years minus tuition is $312,300. The field of study with the highest average potential earning is almost double this amount. The one with the lowest earning power is almost 1/3 of the average accross all Fields of Study.
Histograms for Top 10.
library(ggplot2)
library(scales)
df_top10 = FieldofStudy_Earnings_for_rep[FieldofStudy_Earnings_for_rep$CIPDESC %in% FieldofStudy_Earnings_for_rep_top10$CIPDESC,]
# Change line color and fill color
ggplot(df_top10, aes(x=Salary_10Y_4Y_Tui))+ geom_histogram(color="darkblue", fill="lightblue") + scale_x_continuous(labels=dollar_format()) + facet_wrap(. ~ CIPDESC, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Histograms for Bottom 10.
df_bottom10 = FieldofStudy_Earnings_for_rep[FieldofStudy_Earnings_for_rep$CIPDESC %in% FieldofStudy_Earnings_for_rep_bottom10$CIPDESC,]
# Change line color and fill color
ggplot(df_bottom10, aes(x=Salary_10Y_4Y_Tui))+ geom_histogram(color="darkblue", fill="lightblue") + scale_x_continuous(labels=dollar_format()) + facet_wrap(. ~ CIPDESC, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now find the Field of Studies with the greater variance in outcomes. This means, that there is an important financial difference depending on the institution for these fields of study.
FieldofStudy_Earnings_for_rep_sdev = aggregate(Salary_10Y_4Y_Tui ~ CIPDESC,data= FieldofStudy_Earnings_for_rep, sd)
FieldofStudy_Earnings_for_rep_sdev = FieldofStudy_Earnings_for_rep_sdev[order(FieldofStudy_Earnings_for_rep_sdev$Salary_10Y_4Y_Tui,decreasing=TRUE),]
#make sure they are popular
FieldofStudy_Earnings_for_rep_sdev = FieldofStudy_Earnings_for_rep_sdev[FieldofStudy_Earnings_for_rep_sdev$CIPDESC %in% FieldofStudy_Earnings_for_rep_count[FieldofStudy_Earnings_for_rep_count$INSTNM>49,"CIPDESC"],]
#Top 10
FieldofStudy_Earnings_for_rep_top10_sdev = head(FieldofStudy_Earnings_for_rep_sdev,10)
FieldofStudy_Earnings_for_rep_top10_sdev
| CIPDESC | Salary_10Y_4Y_Tui | |
|---|---|---|
| 14 | Allied Health Diagnostic, Intervention, and Treatment Professions. | 169354.6 |
| 73 | Computer Science. | 166158.7 |
| 71 | Computer Engineering. | 143954.6 |
| 69 | Computer and Information Sciences, General. | 128996.7 |
| 244 | Registered Nursing, Nursing Administration, Nursing Research and Clinical Nursing. | 122115.0 |
| 74 | Computer Software and Media Applications. | 116923.8 |
| 193 | Mathematics. | 109362.1 |
| 147 | Health Services/Allied Health/Health Sciences, General. | 106856.5 |
| 97 | Economics. | 103971.3 |
| 183 | Management Information Systems and Services. | 103022.1 |
For the field of studies in the previous table , it would be a good idea to check the Institutions that offer the best financial outcomes. The following are the histograms.
df_sdev10 = FieldofStudy_Earnings_for_rep[FieldofStudy_Earnings_for_rep$CIPDESC %in% FieldofStudy_Earnings_for_rep_top10_sdev$CIPDESC,]
# Change line color and fill color
ggplot(df_sdev10, aes(x=Salary_10Y_4Y_Tui))+ geom_histogram(color="darkblue", fill="lightblue") + scale_x_continuous(labels=dollar_format()) + facet_wrap(. ~ CIPDESC, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The following box plots focus on the 10 Field of Studies with the highest variance. It provides a view into the outliers.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
#function to determine outliers
#Increase IQR from 1.5 times to 2.0 times to avoid many outliers
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 2.0 * IQR(x) | x > quantile(x, 0.75) + 2.0 * IQR(x))
}
#dat <- df_sdev10[,c("CIPDESC","INSTNM","Salary_10Y_4Y_Tui")] %>% group_by(CIPDESC) %>% mutate(is_outlier=ifelse(is_outlier(Salary_10Y_4Y_Tui),INSTNM,as.numeric(NA)))
#dat$INSTNM[which(is.na(dat$is_outlier))] <- as.numeric(NA)
#This ggplot view overlaps the names of many institutions
#ggplot(dat, aes(y=Salary_10Y_4Y_Tui, x=factor(CIPDESC))) + geom_boxplot() + geom_text(aes(label=INSTNM),na.rm=TRUE,nudge_y=0.5, cex=2, angle=45) + theme(axis.text.x =element_text(angle=45, hjust =1) )
Try yo use plotly to make it more readable
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p <- ggplot(df_sdev10, aes(y=Salary_10Y_4Y_Tui, x=factor(CIPDESC))) + geom_boxplot() + theme(axis.text.x =element_text(angle=45, hjust =1) )
#store description as factor
temp = as.factor(df_sdev10$CIPDESC)
df_sdev10$factorCIPDESC_Salary = paste(as.numeric(temp),df_sdev10$Salary_10Y_4Y_Tui,sep="_")
#create plotly object
ply = ggplotly(p, width = 1280, height = 960)
#maybe I can map ti with ply$x$data[[1]]$x and ply$x$data[[1]]$y
datafrom_ply = as.data.frame(cbind(ply$x$data[[1]]$x,ply$x$data[[1]]$y))
datafrom_ply$formatch = paste(datafrom_ply[,1],datafrom_ply[,2],sep="_")
#have to merge to align based on factor and dollar amount
#Add sequence to return to right order
datafrom_ply$seq = seq(1,dim(datafrom_ply)[1])
# there is duplicated data
#datafrom_ply[duplicated(datafrom_ply$formatch),]
#have to remove these duplicates from df_sdev10
#to avoid merging issues
df_sdev10_unique= df_sdev10[!duplicated(df_sdev10$factorCIPDESC_Salary),]
#merge changes the orger
datafrom_ply = merge(datafrom_ply, df_sdev10_unique[,c("factorCIPDESC_Salary","INSTNM","Salary_10Y_4Y_Tui","CIPDESC")], all.x=TRUE, all.y=FALSE, by.x="formatch", by.y="factorCIPDESC_Salary")
#reorder using the field seq
datafrom_ply=datafrom_ply[order(datafrom_ply$seq),]
#Now INSTNM should be ordered correctly
#Add field with dollar amoung and institution
datafrom_ply$INSTNM_Net_10 = paste(datafrom_ply$INSTNM,": $",datafrom_ply$Salary_10Y_4Y_Tui,sep="")
#Assign Labels to display
ply$x$data[[1]]$text = datafrom_ply$INSTNM_Net_10
ply$x$data[[1]]$hoverinfo <- "text"
ply
This is the end of the basic statistic analsysis. Based on this please check the data tables available to search for a specific field of study and the best institutions (from a a financial perspective) that offers it.
There is a more advanced statistical analysis that review financial outcome by other variables like location and words included in the Field of Study.